<!-- DISABLE-FRONTMATTER-SECTIONS -->

# End-of-chapter quiz[[end-of-chapter-quiz]]

<CourseFloatingBanner
    chapter={6}
    classNames="absolute z-10 right-0 top-0"
/>

Let's test what you learned in this chapter!

### 1. When should you train a new tokenizer?

<Question
	choices={[
		{
			text: "When your dataset is similar to that used by an existing pretrained model, and you want to pretrain a new model",
			explain: "In this case, to save time and compute resources, a better choice would be to use the same tokenizer as the pretrained model and fine-tune that model instead."
		},
		{
			text: "When your dataset is similar to that used by an existing pretrained model, and you want to fine-tune a new model using this pretrained model",
			explain: "To fine-tune a model from a pretrained model, you should always use the same tokenizer."
		},
		{
			text: "When your dataset is different from the one used by an existing pretrained model, and you want to pretrain a new model",
			explain: "Correct! In this case there's no advantage to using the same tokenizer.",
            correct: true
		},
        {
			text: "When your dataset is different from the one used by an existing pretrained model, but you want to fine-tune a new model using this pretrained model",
			explain: "To fine-tune a model from a pretrained model, you should always use the same tokenizer."
		}
	]}
/>

### 2. What is the advantage of using a generator of lists of texts compared to a list of lists of texts when using `train_new_from_iterator()`?

<Question
	choices={[
		{
			text: "That's the only type the method <code>train_new_from_iterator()</code> accepts.",
			explain: "A list of lists of texts is a particular kind of generator of lists of texts, so the method will accept this too. Try again!"
		},
		{
			text: "You will avoid loading the whole dataset into memory at once.",
			explain: "Right! Each batch of texts will be released from memory when you iterate, and the gain will be especially visible if you use 🤗 Datasets to store your texts.",
			correct: true
		},
		{
			text: "This will allow the 🤗 Tokenizers library to use multiprocessing.",
			explain: "No, it will use multiprocessing either way."
		},
        {
			text: "The tokenizer you train will generate better texts.",
			explain: "The tokenizer does not generate text -- are you confusing it with a language model?"
		}
	]}
/>

### 3. What are the advantages of using a "fast" tokenizer?

<Question
	choices={[
		{
			text: "It can process inputs faster than a slow tokenizer when you batch lots of inputs together.",
			explain: "Correct! Thanks to parallelism implemented in Rust, it will be faster on batches of inputs. What other benefit can you think of?",
			correct: true
		},
		{
			text: "Fast tokenizers always tokenize faster than their slow counterparts.",
			explain: "A fast tokenizer can actually be slower when you only give it one or very few texts, since it can't use parallelism."
		},
		{
			text: "It can apply padding and truncation.",
			explain: "True, but slow tokenizers also do that."
		},
        {
			text: "It has some additional features allowing you to map tokens to the span of text that created them.",
			explain: "Indeed -- those are called offset mappings. That's not the only advantage, though.",
			correct: true
		}
	]}
/>

### 4. How does the `token-classification` pipeline handle entities that span over several tokens?

<Question
	choices={[
		{
			text: "The entities with the same label are merged into one entity.",
			explain: "That's oversimplifying things a little. Try again!"
		},
		{
			text: "There is a label for the beginning of an entity and a label for the continuation of an entity.",
			explain: "Correct!",
			correct: true
		},
		{
			text: "In a given word, as long as the first token has the label of the entity, the whole word is considered labeled with that entity.",
			explain: "That's one strategy to handle entities. What other answers here apply?",
			correct: true
		},
        {
			text: "When a token has the label of a given entity, any other following token with the same label is considered part of the same entity, unless it's labeled as the start of a new entity.",
			explain: "That's the most common way to group entities together -- it's not the only right answer, though.",
			correct: true
		}
	]}
/>

### 5. How does the `question-answering` pipeline handle long contexts?

<Question
	choices={[
		{
			text: "It doesn't really, as it truncates the long context at the maximum length accepted by the model.",
			explain: "There is a trick you can use to handle long contexts. Do you remember what it is?"
		},
		{
			text: "It splits the context into several parts and averages the results obtained.",
			explain: "No, it wouldn't make sense to average the results, as some parts of the context won't include the answer."
		},
		{
			text: "It splits the context into several parts (with overlap) and finds the maximum score for an answer in each part.",
			explain: "That's the correct answer!",
			correct: true
		},
        {
			text: "It splits the context into several parts (without overlap, for efficiency) and finds the maximum score for an answer in each part.",
			explain: "No, it includes some overlap between the parts to avoid a situation where the answer would be split across two parts."
		}
	]}
/>

### 6. What is normalization?

<Question
	choices={[
		{
			text: "It's any cleanup the tokenizer performs on the texts in the initial stages.",
			explain: "That's correct -- for instance, it might involve removing accents or whitespace, or lowercasing the inputs.",
			correct: true
		},
		{
			text: "It's a data augmentation technique that involves making the text more normal by removing rare words.",
			explain: "That's incorrect! Try again."
		},
		{
			text: "It's the final post-processing step where the tokenizer adds the special tokens.",
			explain: "That stage is simply called post-processing."
		},
        {
			text: "It's when the embeddings are made with mean 0 and standard deviation 1, by subtracting the mean and dividing by the std.",
			explain: "That process is commonly called normalization when applied to pixel values in computer vision, but it's not what normalization means in NLP."
		}
	]}
/>

### 7. What is pre-tokenization for a subword tokenizer?

<Question
	choices={[
		{
			text: "It's the step before the tokenization, where data augmentation (like random masking) is applied.",
			explain: "No, that step is part of the preprocessing."
		},
		{
			text: "It's the step before the tokenization, where the desired cleanup operations are applied to the text.",
			explain: "No, that's the normalization step."
		},
		{
			text: "It's the step before the tokenizer model is applied, to split the input into words.",
			explain: "That's the correct answer!",
			correct: true
		},
        {
			text: "It's the step before the tokenizer model is applied, to split the input into tokens.",
			explain: "No, splitting into tokens is the job of the tokenizer model."
		}
	]}
/>

### 8. Select the sentences that apply to the BPE model of tokenization.

<Question
	choices={[
		{
			text: "BPE is a subword tokenization algorithm that starts with a small vocabulary and learns merge rules.",
			explain: "That's the case indeed!",
			correct: true
		},
		{
			text: "BPE is a subword tokenization algorithm that starts with a big vocabulary and progressively removes tokens from it.",
			explain: "No, that's the approach taken by a different tokenization algorithm."
		},
		{
			text: "BPE tokenizers learn merge rules by merging the pair of tokens that is the most frequent.",
			explain: "That's correct!",
			correct: true
		},
		{
			text: "A BPE tokenizer learns a merge rule by merging the pair of tokens that maximizes a score that privileges frequent pairs with less frequent individual parts.",
			explain: "No, that's the strategy applied by another tokenization algorithm."
		},
		{
			text: "BPE tokenizes words into subwords by splitting them into characters and then applying the merge rules.",
			explain: "That's correct!",
			correct: true
		},
		{
			text: "BPE tokenizes words into subwords by finding the longest subword starting from the beginning that is in the vocabulary, then repeating the process for the rest of the text.",
			explain: "No, that's another tokenization algorithm's way of doing things."
		},
	]}
/>

### 9. Select the sentences that apply to the WordPiece model of tokenization.

<Question
	choices={[
		{
			text: "WordPiece is a subword tokenization algorithm that starts with a small vocabulary and learns merge rules.",
			explain: "That's the case indeed!",
			correct: true
		},
		{
			text: "WordPiece is a subword tokenization algorithm that starts with a big vocabulary and progressively removes tokens from it.",
			explain: "No, that's the approach taken by a different tokenization algorithm."
		},
		{
			text: "WordPiece tokenizers learn merge rules by merging the pair of tokens that is the most frequent.",
			explain: "No, that's the strategy applied by another tokenization algorithm."
		},
		{
			text: "A WordPiece tokenizer learns a merge rule by merging the pair of tokens that maximizes a score that privileges frequent pairs with less frequent individual parts.",
			explain: "That's correct!",
			correct: true
		},
		{
			text: "WordPiece tokenizes words into subwords by finding the most likely segmentation into tokens, according to the model.",
			explain: "No, that's how another tokenization algorithm works."
		},
		{
			text: "WordPiece tokenizes words into subwords by finding the longest subword starting from the beginning that is in the vocabulary, then repeating the process for the rest of the text.",
			explain: "Yes, this is how WordPiece proceeds for the encoding.",
			correct: true
		},
	]}
/>

### 10. Select the sentences that apply to the Unigram model of tokenization.

<Question
	choices={[
		{
			text: "Unigram is a subword tokenization algorithm that starts with a small vocabulary and learns merge rules.",
			explain: "No, that's the approach taken by a different tokenization algorithm."
		},
		{
			text: "Unigram is a subword tokenization algorithm that starts with a big vocabulary and progressively removes tokens from it.",
			explain: "That's correct!",
			correct: true
		},
		{
			text: "Unigram adapts its vocabulary by minimizing a loss computed over the whole corpus.",
			explain: "That's correct!",
			correct: true
		},
		{
			text: "Unigram adapts its vocabulary by keeping the most frequent subwords.",
			explain: "No, this incorrect."
		},
		{
			text: "Unigram tokenizes words into subwords by finding the most likely segmentation into tokens, according to the model.",
			explain: "That's correct!",
			correct: true
		},
		{
			text: "Unigram tokenizes words into subwords by splitting them into characters, then applying the merge rules.",
			explain: "No, that's how another tokenization algorithm works."
		},
	]}
/>
